118        Bioinformatics

FASTQ files). When the FASTQ files have been downloaded successfully as shown in

Figure 4.3, we can use “ls fastq” to display the files in the new created directory.

4.2.1.1.2  Downloading and indexing the reference genome sequence

The reads in the FASTQ files must be aligned to a reference genome of the organism stud-

ied. Therefore, the latest FASTA file of the reference genome sequence is downloaded from

a genome database into a local drive. The NCBI Genome database “https://www.ncbi.nlm.

nih.gov/genome/” is one of the databases that curates reference genome sequences. We can

use the database query box to search for the latest reference genome of “SARS-CoV-2” and

copy the link to the FASTA sequence of the reference genome. The following bash script

creates the “ref” subdirectory, downloads the compressed FASTA sequence of the latest

SARS-CoV-2 reference genome into that subdirectory, and decompresses the FASTA file.

Notice that the URL of the reference sequence may change if a new version is available.

Therefore, visit the reference genome page for the latest sequence. When you use the fol-

lowing script, make sure that “wget” and the URL are in the same line and that there is no

whitespace in the URL. After downloading the reference sequence, use “ls” to make sure

the file has been downloaded and decompressed.

mkdir ref

cd ref

wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/895/

GCF_009858895.2_ASM985889v3/GCF_009858895.2_ASM985889v3_genomic.

fna.gz

f=$(ls *.*)

gzip -d ${f}

4.2.1.1.3  Indexing the FASTA file of the reference genome

As discussed in Chapter 2, we need to index the FASTA sequence of the reference genome

with both “samtools faidx” and the aligner used for mapping. In this example, we will use

“bwa” aligner; therefore, we will use “bwa index” for indexing as well. The following bash

script uses “samtools” and bwa” to index the reference genome:

f=$(ls *.*)

samtools faidx ${f}

bwa index ${f}

cd ..

When you display the content of the “ref” subdirectory, you may see the following files if

you follow the above steps successfully:

GCF_009858895.2_ASM985889v3_genomic.fna

GCF_009858895.2_ASM985889v3_genomic.fna.amb

GCF_009858895.2_ASM985889v3_genomic.fna.ann

GCF_009858895.2_ASM985889v3_genomic.fna.bwt